Marketing can help businesses increase brand awareness, engagement and sales with promotional campaigns. No matter what area a business focuses on, they can take advantage of all the benefits marketing can offer and expand their reach.
One of method for the marketing team to understand their customer, is by dividing their customer by their characteristic which is called customer segmentation. Customer segmentation is the process by which you divide your customers up based on common characteristics – such as demographics or behaviours, so you can market to those customers more effectively.
In addition, if data about the customers is available, the data science team can help performing a customer segmentation.
The competitive in financial industries are getting harder in the next decade. One of this industry main source of revenue are Interest Income which they could get by giving loan or credit payment facilities to customer. Therefore, the more the credit are given, the more interest they get.
Since the data are collected by every credit activities, the company hope they could get some insight by processing the data. This time, we have a data contains summary of the usage behavior of about 9000 active credit card holders during the last 6 months. We will process this data using unsupervised learning methodology to segmentize the customer by finding a certain pattern in hope we could find some characteristic between each customer segment.
Then we will analyze each segment and plan the marketing approach that work best with each segment. We also give some recommendation for the next if we want to update the model or strategies.
We will do explanatory data analysis and finding some insight from the data. Then we will cluster the data using unsupervised learning with K-Means. After the data is segmented, we will decide the marketing approach for each segment.
We will decide the numbers of cluster by using Elbow Method and Silhouette Method. Where in Elbow Method, the number of clusters are decided when the addition of one cluster does not provide significant change in the level of similarity, while in silhouette method, the number of cluster is decided by how close each point in one cluster is to points in the neighboring clusters.
Dataset are obtained from: https://www.kaggle.com/datasets/arjunbhasin2013/ccdata
"The sample Dataset summarizes the usage behavior of about 9000 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioral variables."
Attributes Information
| Attribute | Data Type | Description |
|---|---|---|
| cust_id | Object | Identification of Credit Card holder (Categorical) |
| balance | Float | balance amount left in their account to make purchases |
| balance_frequency | Float | How frequently the balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated) |
| purchases | Float | Amount of purchases made from account (oneoff_PURCHASE+installments_purchases) |
| oneoff_purchases | Float | Maximum purchase amount done in one-go |
| installments_purchases | Float | Amount of purchase done in installment |
| cash_advance | Float | Cash in advance given by the user |
| purchases_frequency | Float | How frequently the purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased) |
| oneoff_purchases_frequency | Float | How frequently purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased) |
| purchases_installments_frequency | Float | How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done) |
| cash_advance_frequency | Float | How frequently the cash in advance being paid |
| cash_advance_trx | Integer | Number of Transactions made with "Cash in Advanced" |
| purchases_trx | Integer | Number of purchase transactions made |
| credit_limit | Float | Limit of Credit Card for user |
| payments | Float | Amount of payment done by user |
| minimum_payments | Float | minimum amount of payments made by user |
| prc_full_payment | Float | Percent of full payment paid by user |
| tenure | Integer | tenure of credit card service for user in years |
# Import library
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from matplotlib import colors
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import product
import missingno
# Preprocessing Data
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Modeling
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist, pdist
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering, Birch
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
# Handling Warning
import warnings
warnings.filterwarnings('ignore')
# load dataset
# df=pd.read_csv(r"C:\Users\62821\File\CC GENERAL.csv")
df = pd.read_csv('CC GENERAL.csv')
# We convert name of feature to lower font
df.columns = map(str.lower, df.columns)
df.head()
| cust_id | balance | balance_frequency | purchases | oneoff_purchases | installments_purchases | cash_advance | purchases_frequency | oneoff_purchases_frequency | purchases_installments_frequency | cash_advance_frequency | cash_advance_trx | purchases_trx | credit_limit | payments | minimum_payments | prc_full_payment | tenure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C10001 | 40.900749 | 0.818182 | 95.40 | 0.00 | 95.4 | 0.000000 | 0.166667 | 0.000000 | 0.083333 | 0.000000 | 0 | 2 | 1000.0 | 201.802084 | 139.509787 | 0.000000 | 12 |
| 1 | C10002 | 3202.467416 | 0.909091 | 0.00 | 0.00 | 0.0 | 6442.945483 | 0.000000 | 0.000000 | 0.000000 | 0.250000 | 4 | 0 | 7000.0 | 4103.032597 | 1072.340217 | 0.222222 | 12 |
| 2 | C10003 | 2495.148862 | 1.000000 | 773.17 | 773.17 | 0.0 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0 | 12 | 7500.0 | 622.066742 | 627.284787 | 0.000000 | 12 |
| 3 | C10004 | 1666.670542 | 0.636364 | 1499.00 | 1499.00 | 0.0 | 205.788017 | 0.083333 | 0.083333 | 0.000000 | 0.083333 | 1 | 1 | 7500.0 | 0.000000 | NaN | 0.000000 | 12 |
| 4 | C10005 | 817.714335 | 1.000000 | 16.00 | 16.00 | 0.0 | 0.000000 | 0.083333 | 0.083333 | 0.000000 | 0.000000 | 0 | 1 | 1200.0 | 678.334763 | 244.791237 | 0.000000 | 12 |
# drop unused columns
df_wo_id = df.drop('cust_id', axis=1)
# creating distribution plot for analysis
fig = plt.figure(figsize=(30, 25), constrained_layout=True)
for i in range(len(df_wo_id.columns)):
plt.subplot(6, 3, i+1)
sns.histplot(df_wo_id[df_wo_id.columns[i]], kde=True)
Key Points:
purchases, oneoff_purchases, INSTALLMENT_purchases, and cash_advance show the same trends as balance and credit limit. (2)*For further explanation and figure, please refer to the link beside the point.
#plot relation credit limit with balance
sns.pairplot(data=df_wo_id, x_vars='credit_limit', y_vars='balance',
height=6, aspect=1.5).map(sns.kdeplot, levels=1, color='red');
The plot explained that most of user are distributed around Credit Limit below 13.000 and balance Below 8.000.
#relation plot between features
sns.pairplot(data=df_wo_id,
y_vars=['balance', 'credit_limit'],
x_vars=['purchases', 'oneoff_purchases', 'installments_purchases', 'cash_advance'],
height=6, aspect=1.4).map(sns.kdeplot, levels=1, color='red');
The plot above explained that feature like purchases, oneoff_purchases, INSTALLMENT_purchases, and cash_advance are showing trends that similar to credit limit and balance. This due to the balance and credit limit from the users are low, therefore the purchase also low.
#displaying purchase frequency above 0.5
print(len(df_wo_id[df_wo_id['purchases_installments_frequency'] > .5]), 'purchases_installments_frequency')
print(len(df_wo_id[df_wo_id['oneoff_purchases_frequency'] > .5]), 'oneoff_purchases_frequency')
3090 purchases_installments_frequency 1343 oneoff_purchases_frequency
#pie plot for installment vs oneoff
plt.figure(figsize=(7,7))
plt.pie([len(df_wo_id[df_wo_id['purchases_installments_frequency'] > .5]),
len(df_wo_id[df_wo_id['oneoff_purchases_frequency'] > .5])],
labels = ['purchases_installments_frequency',
'oneoff_purchases_frequency'],
autopct='%.0f%%')
plt.title('purchases_installments_frequency VS oneoff_purchases_frequency (>0.5)');
The plot above explained that most of user made purchase using installment rather than one off payment.
#pair plot for payments vs minimum payments
sns.pairplot(data=df_wo_id, y_vars='payments', x_vars='minimum_payments',
height=6, aspect=1.4).map(sns.kdeplot, levels=1, color='red');
The plot above explain that in the kde line we can see that the most of users are distributed around had made payment below 12.000 and their minimum payments are below 5.000.
Balance
# creating plot analysis
plt.figure(figsize=(18, 14))
columns=['purchases','oneoff_purchases','installments_purchases','cash_advance','credit_limit','payments']
n=1
for x in columns:
plt.subplot(2,3,n)
plt.scatter(df_wo_id['balance'], df_wo_id[x], alpha=0.25)
plt.xlabel('Balance')
plt.ylabel(x)
n=n+1
plt.ticklabel_format(useOffset=False, style='plain')
plt.show()
99% of user have balance below 10.000 and most of them(more than around 80%):
Analysis:
Since most of users have balance up to 10.000 and half of them are paid using oneoff method. Since the customer already have a good spending habits, we should try to think a way to convert this kind of customer to try paid using credit cards.
Purchase
# creating plot analysis
plt.figure(figsize=(18, 14))
columns=['balance','oneoff_purchases','installments_purchases','cash_advance','credit_limit','payments']
n=1
for x in columns:
plt.subplot(2,3,n)
plt.scatter(df_wo_id['purchases'], df_wo_id[x], alpha=0.25)
plt.xlabel('Purchase')
plt.ylabel(x)
n=n+1
plt.ticklabel_format(useOffset=False, style='plain')
plt.show()
As observed, 99% customers make purchases up to 10,000 and:
Analysis:
Most of user that made purchase up to 10.000 have credit limit up to 20.000. This means the customer still not use the credit card to its fullest. We assume this occured due to whether the credit interest are too high, bank credit card does not support payment to many outlet, most of user credit card is still new, or there are better credit card than this bank.
Credit Limit
# creating plot
plt.figure(figsize=(18, 14))
columns=['oneoff_purchases','installments_purchases','cash_advance','payments']
n=1
for x in columns:
plt.subplot(2,2,n)
plt.scatter(df_wo_id['credit_limit'], df_wo_id[x], alpha=0.25)
plt.xlabel('Credit Limit')
plt.ylabel(x)
n=n+1
plt.ticklabel_format(useOffset=False, style='plain')
plt.show()
99% customers have credit limit up to 20,000 and:
Analysis:
Most of the customer already have a high credit limit. We assume there are no significant problem in the credit limit rather we should analyze the customer behaviour.
# calculates the correlations
correlation = df_wo_id.corr(method='pearson')
# uses the variable ax for single a Axes
fig, ax = plt.subplots()
# sets the figure size in inches
ax.figure.set_size_inches(20, 20)
# generates a mask for the upper triangle
mask = np.triu(np.ones_like(correlation, dtype=np.bool))
# generates a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# plots the heatmap
sns.heatmap(correlation, cmap=cmap, mask=mask, square=True, linewidths=.5,
annot=True, annot_kws={'size':14})
# displays the plot
plt.show()
Very strong correlations:
Medium correlations:
Outlier, Anomaly, Duplicates, Identify missing value etc¶
# check missing value
df_wo_id.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8950 entries, 0 to 8949 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 balance 8950 non-null float64 1 balance_frequency 8950 non-null float64 2 purchases 8950 non-null float64 3 oneoff_purchases 8950 non-null float64 4 installments_purchases 8950 non-null float64 5 cash_advance 8950 non-null float64 6 purchases_frequency 8950 non-null float64 7 oneoff_purchases_frequency 8950 non-null float64 8 purchases_installments_frequency 8950 non-null float64 9 cash_advance_frequency 8950 non-null float64 10 cash_advance_trx 8950 non-null int64 11 purchases_trx 8950 non-null int64 12 credit_limit 8949 non-null float64 13 payments 8950 non-null float64 14 minimum_payments 8637 non-null float64 15 prc_full_payment 8950 non-null float64 16 tenure 8950 non-null int64 dtypes: float64(14), int64(3) memory usage: 1.2 MB
# Check detailed info for each feature
listItem = []
for col in df_wo_id.columns :
listItem.append([col, df_wo_id[col].dtype, df_wo_id[col].isna().sum(), round((df_wo_id[col].isna().sum()/len(df_wo_id[col])) * 100,2),
df_wo_id[col].nunique(), list(df_wo_id[col].drop_duplicates().sample(2).values)]);
dfDesc = pd.DataFrame(columns=['dataFeatures', 'dataType', 'missing value', 'nullPct', 'unique', 'uniqueSample'],
data=listItem)
dfDesc
| dataFeatures | dataType | missing value | nullPct | unique | uniqueSample | |
|---|---|---|---|---|---|---|
| 0 | balance | float64 | 0 | 0.00 | 8871 | [261.462528, 131.290079] |
| 1 | balance_frequency | float64 | 0 | 0.00 | 43 | [0.166667, 0.363636] |
| 2 | purchases | float64 | 0 | 0.00 | 6203 | [839.05, 1847.57] |
| 3 | oneoff_purchases | float64 | 0 | 0.00 | 4014 | [801.35, 53.0] |
| 4 | installments_purchases | float64 | 0 | 0.00 | 4452 | [606.51, 1255.68] |
| 5 | cash_advance | float64 | 0 | 0.00 | 4323 | [7968.273359, 366.121216] |
| 6 | purchases_frequency | float64 | 0 | 0.00 | 47 | [0.333333, 0.727273] |
| 7 | oneoff_purchases_frequency | float64 | 0 | 0.00 | 47 | [0.818182, 0.916667] |
| 8 | purchases_installments_frequency | float64 | 0 | 0.00 | 47 | [0.875, 0.25] |
| 9 | cash_advance_frequency | float64 | 0 | 0.00 | 54 | [0.7, 0.166667] |
| 10 | cash_advance_trx | int64 | 0 | 0.00 | 65 | [13, 93] |
| 11 | purchases_trx | int64 | 0 | 0.00 | 173 | [71, 6] |
| 12 | credit_limit | float64 | 1 | 0.01 | 205 | [22500.0, 15000.0] |
| 13 | payments | float64 | 0 | 0.00 | 8711 | [539.459472, 2077.205644] |
| 14 | minimum_payments | float64 | 313 | 3.50 | 8636 | [672.890662, 1452.249653] |
| 15 | prc_full_payment | float64 | 0 | 0.00 | 47 | [0.666667, 0.111111] |
| 16 | tenure | int64 | 0 | 0.00 | 7 | [11, 7] |
# Distribution plot to visualize data distribution
hist_b_ho = df_wo_id.hist(bins=50,figsize=(30,20))
hist_b_ho
plt.show()
# plot for outliers detection
plt.figure(figsize=(25,20))
for i in range(len(df_wo_id.columns)):
plt.subplot(5,4,i+1)
sns.boxplot(df_wo_id[df_wo_id.columns[i]])
plt.title(df_wo_id.columns[i])
plt.tight_layout()
# creating function to generate IQR, lower limit, and Upper limit
def find_outlier(df, feature):
print('Outlier ' + feature)
q1 = df[feature].quantile(0.25)
q3 = df[feature].quantile(0.75)
iqr = q3 - q1
limit = iqr*1.5
print(f'IQR: {iqr}')
Lower_Limit = q1 - limit
Upper_Limit = q3 + limit
print(f'Lower_Limit: {Lower_Limit}')
print(f'Upper_Limit: {Upper_Limit}')
print('_________________________')
# check IQR, upper limit, and lower limit for each feature
for i in df_wo_id :
find_outlier(df_wo_id, i)
Outlier balance IQR: 1925.85812 Lower_Limit: -2760.5052645 Upper_Limit: 4942.9272155 _________________________ Outlier balance_frequency IQR: 0.11111099999999996 Lower_Limit: 0.7222225000000001 Upper_Limit: 1.1666664999999998 _________________________ Outlier purchases IQR: 1070.4950000000001 Lower_Limit: -1566.1075000000003 Upper_Limit: 2715.8725000000004 _________________________ Outlier oneoff_purchases IQR: 577.405 Lower_Limit: -866.1075 Upper_Limit: 1443.5124999999998 _________________________ Outlier installments_purchases IQR: 468.6375 Lower_Limit: -702.95625 Upper_Limit: 1171.59375 _________________________ Outlier cash_advance IQR: 1113.8211392500002 Lower_Limit: -1670.7317088750003 Upper_Limit: 2784.5528481250003 _________________________ Outlier purchases_frequency IQR: 0.833334 Lower_Limit: -1.166668 Upper_Limit: 2.166668 _________________________ Outlier oneoff_purchases_frequency IQR: 0.3 Lower_Limit: -0.44999999999999996 Upper_Limit: 0.75 _________________________ Outlier purchases_installments_frequency IQR: 0.75 Lower_Limit: -1.125 Upper_Limit: 1.875 _________________________ Outlier cash_advance_frequency IQR: 0.222222 Lower_Limit: -0.333333 Upper_Limit: 0.555555 _________________________ Outlier cash_advance_trx IQR: 4.0 Lower_Limit: -6.0 Upper_Limit: 10.0 _________________________ Outlier purchases_trx IQR: 16.0 Lower_Limit: -23.0 Upper_Limit: 41.0 _________________________ Outlier credit_limit IQR: 4900.0 Lower_Limit: -5750.0 Upper_Limit: 13850.0 _________________________ Outlier payments IQR: 1517.8581507500003 Lower_Limit: -1893.5110601250003 Upper_Limit: 4177.921542875 _________________________ Outlier minimum_payments IQR: 656.361752 Lower_Limit: -815.4189210000001 Upper_Limit: 1810.0280870000001 _________________________ Outlier prc_full_payment IQR: 0.142857 Lower_Limit: -0.21428550000000002 Upper_Limit: 0.35714250000000003 _________________________ Outlier tenure IQR: 0.0 Lower_Limit: 12.0 Upper_Limit: 12.0 _________________________
# creating outliers columns
out1 = df_wo_id[df_wo_id['balance'] > 4942.9272155]
out2 = df_wo_id[(df_wo_id['balance_frequency'] < 0.7222225000000001) & (df['balance_frequency'] > 1.1666664999999998)]
out3 = df_wo_id[df_wo_id['purchases'] > 2715.8725000000004]
out4 = df_wo_id[df_wo_id['oneoff_purchases'] > 1443.5124999999998]
out5 = df_wo_id[df_wo_id['installments_purchases'] > 1171.59375]
out6 = df_wo_id[df_wo_id['cash_advance'] > 2784.5528481250003]
out7 = df_wo_id[df_wo_id['purchases_frequency'] > 2.166668]
out8 = df_wo_id[df_wo_id['oneoff_purchases_frequency'] > 0.75]
out9 = df_wo_id[df_wo_id['purchases_installments_frequency'] > 1.875]
out10 = df_wo_id[df_wo_id['cash_advance_frequency'] > 0.555555]
out11 = df_wo_id[df_wo_id['cash_advance_trx'] > 10.0]
out12 = df_wo_id[df_wo_id['purchases_trx'] > 41.0]
out13 = df_wo_id[df_wo_id['credit_limit'] > 13850.0]
out14 = df_wo_id[df_wo_id['payments'] > 4177.921542875]
out15 = df_wo_id[df_wo_id['minimum_payments'] > 1810.0280870000001]
out16 = df_wo_id[df_wo_id['prc_full_payment'] > 0.35714250000000003]
out17 = df_wo_id[(df_wo_id['tenure'] != 12.0)]
# creating outliers dataframe
out_all = pd.concat([out1, out2, out3, out4, out5, out6, out7, out8, out9, out10, out11, out12, out13, out14, out15, out16, out17], axis = 0)
out_all.drop_duplicates(inplace=True)
out_all
| balance | balance_frequency | purchases | oneoff_purchases | installments_purchases | cash_advance | purchases_frequency | oneoff_purchases_frequency | purchases_installments_frequency | cash_advance_frequency | cash_advance_trx | purchases_trx | credit_limit | payments | minimum_payments | prc_full_payment | tenure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15 | 6886.213231 | 1.000000 | 1611.70 | 0.00 | 1611.70 | 2301.491267 | 0.500000 | 0.000000 | 0.500000 | 0.166667 | 4 | 11 | 8000.0 | 1993.439277 | 2109.906490 | 0.00 | 12 |
| 21 | 6369.531318 | 1.000000 | 6359.95 | 5910.04 | 449.91 | 229.028245 | 1.000000 | 0.916667 | 1.000000 | 0.333333 | 6 | 92 | 11250.0 | 2077.959051 | 1659.775075 | 0.00 | 12 |
| 24 | 5368.571219 | 1.000000 | 0.00 | 0.00 | 0.00 | 798.949863 | 0.000000 | 0.000000 | 0.000000 | 0.363636 | 4 | 0 | 6000.0 | 1422.726707 | 1657.002877 | 0.00 | 11 |
| 28 | 7152.864372 | 1.000000 | 387.05 | 204.55 | 182.50 | 2236.145259 | 0.666667 | 0.166667 | 0.416667 | 0.833333 | 16 | 8 | 10500.0 | 1601.448347 | 1648.851345 | 0.00 | 12 |
| 30 | 12136.219960 | 1.000000 | 3038.01 | 1013.20 | 2024.81 | 3183.583301 | 1.000000 | 0.166667 | 1.000000 | 0.500000 | 10 | 64 | 13000.0 | 4230.323491 | 3292.015030 | 0.00 | 12 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8944 | 193.571722 | 0.833333 | 1012.73 | 1012.73 | 0.00 | 0.000000 | 0.333333 | 0.333333 | 0.000000 | 0.000000 | 0 | 2 | 4000.0 | 0.000000 | NaN | 0.00 | 6 |
| 8946 | 19.183215 | 1.000000 | 300.00 | 0.00 | 300.00 | 0.000000 | 1.000000 | 0.000000 | 0.833333 | 0.000000 | 0 | 6 | 1000.0 | 275.861322 | NaN | 0.00 | 6 |
| 8947 | 23.398673 | 0.833333 | 144.40 | 0.00 | 144.40 | 0.000000 | 0.833333 | 0.000000 | 0.666667 | 0.000000 | 0 | 5 | 1000.0 | 81.270775 | 82.418369 | 0.25 | 6 |
| 8948 | 13.457564 | 0.833333 | 0.00 | 0.00 | 0.00 | 36.558778 | 0.000000 | 0.000000 | 0.000000 | 0.166667 | 2 | 0 | 500.0 | 52.549959 | 55.755628 | 0.25 | 6 |
| 8949 | 372.708075 | 0.666667 | 1093.25 | 1093.25 | 0.00 | 127.040008 | 0.666667 | 0.666667 | 0.000000 | 0.333333 | 2 | 23 | 1200.0 | 63.165404 | 88.288956 | 0.00 | 6 |
5238 rows × 17 columns
#check outliers percentage
print('Percentage Outlier')
len(out_all)/len(df)*100
Percentage Outlier
58.52513966480447
Since the outliers are mainly due to the nature of the dataset, we decide not to drop the outliers. We will try several model that could fit properly with this dataset.
#Anomaly 1
df_wo_id[df_wo_id['balance'] > df_wo_id['credit_limit']]
| balance | balance_frequency | purchases | oneoff_purchases | installments_purchases | cash_advance | purchases_frequency | oneoff_purchases_frequency | purchases_installments_frequency | cash_advance_frequency | cash_advance_trx | purchases_trx | credit_limit | payments | minimum_payments | prc_full_payment | tenure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 1809.828751 | 1.0 | 1333.28 | 0.0 | 1333.28 | 0.000000 | 0.666667 | 0.000000 | 0.583333 | 0.000000 | 0 | 8 | 1800.0 | 1400.057770 | 2407.246035 | 0.000000 | 12 |
| 10 | 1293.124939 | 1.0 | 920.12 | 0.0 | 920.12 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0 | 12 | 1200.0 | 1083.301007 | 2172.697765 | 0.000000 | 12 |
| 20 | 2016.684686 | 1.0 | 176.68 | 0.0 | 176.68 | 0.000000 | 0.666667 | 0.000000 | 0.666667 | 0.000000 | 0 | 8 | 2000.0 | 223.068600 | 13557.297260 | 0.000000 | 12 |
| 64 | 1923.886805 | 1.0 | 1887.64 | 0.0 | 1887.64 | 1605.949371 | 1.000000 | 0.000000 | 1.000000 | 0.083333 | 4 | 61 | 1850.0 | 3688.045998 | 3266.996777 | 0.083333 | 12 |
| 78 | 1205.716678 | 1.0 | 0.00 | 0.0 | 0.00 | 38.325801 | 0.000000 | 0.000000 | 0.000000 | 0.083333 | 1 | 0 | 1200.0 | 29.278671 | 8242.687678 | 0.000000 | 12 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8589 | 1283.337407 | 1.0 | 1022.41 | 171.6 | 850.81 | 1158.035224 | 0.875000 | 0.125000 | 0.750000 | 0.250000 | 5 | 16 | 1200.0 | 846.998942 | 1838.410150 | 0.000000 | 8 |
| 8614 | 1182.080141 | 1.0 | 266.68 | 0.0 | 266.68 | 848.848179 | 0.666667 | 0.000000 | 0.666667 | 0.083333 | 6 | 8 | 1000.0 | 119.861878 | 8607.234543 | 0.000000 | 12 |
| 8624 | 1012.089680 | 1.0 | 312.48 | 0.0 | 312.48 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0 | 12 | 1000.0 | 325.606238 | 607.957480 | 0.000000 | 12 |
| 8724 | 3002.791004 | 1.0 | 2463.00 | 2463.0 | 0.00 | 797.158950 | 0.166667 | 0.166667 | 0.000000 | 0.166667 | 3 | 3 | 3000.0 | 431.716492 | 16638.733850 | 0.000000 | 12 |
| 8873 | 1023.883008 | 1.0 | 585.84 | 0.0 | 585.84 | 956.849209 | 1.000000 | 0.000000 | 0.875000 | 0.125000 | 2 | 22 | 1000.0 | 212.105443 | 2789.590897 | 0.000000 | 8 |
227 rows × 17 columns
Feature "balance" are the balance amount of customer had in their bank account
# Anomaly 2
df_wo_id[df_wo_id['minimum_payments'] > df_wo_id['payments']]
| balance | balance_frequency | purchases | oneoff_purchases | installments_purchases | cash_advance | purchases_frequency | oneoff_purchases_frequency | purchases_installments_frequency | cash_advance_frequency | cash_advance_trx | purchases_trx | credit_limit | payments | minimum_payments | prc_full_payment | tenure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 2495.148862 | 1.000000 | 773.17 | 773.17 | 0.00 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0 | 12 | 7500.0 | 622.066742 | 627.284787 | 0.00 | 12 |
| 5 | 1809.828751 | 1.000000 | 1333.28 | 0.00 | 1333.28 | 0.000000 | 0.666667 | 0.000000 | 0.583333 | 0.000000 | 0 | 8 | 1800.0 | 1400.057770 | 2407.246035 | 0.00 | 12 |
| 10 | 1293.124939 | 1.000000 | 920.12 | 0.00 | 920.12 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0 | 12 | 1200.0 | 1083.301007 | 2172.697765 | 0.00 | 12 |
| 14 | 2772.772734 | 1.000000 | 0.00 | 0.00 | 0.00 | 346.811390 | 0.000000 | 0.000000 | 0.000000 | 0.083333 | 1 | 0 | 3000.0 | 805.647974 | 989.962866 | 0.00 | 12 |
| 15 | 6886.213231 | 1.000000 | 1611.70 | 0.00 | 1611.70 | 2301.491267 | 0.500000 | 0.000000 | 0.500000 | 0.166667 | 4 | 11 | 8000.0 | 1993.439277 | 2109.906490 | 0.00 | 12 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8933 | 735.652303 | 1.000000 | 619.60 | 255.62 | 363.98 | 546.902403 | 1.000000 | 0.166667 | 0.833333 | 0.166667 | 5 | 16 | 1000.0 | 106.138603 | 337.294767 | 0.00 | 6 |
| 8939 | 728.352548 | 1.000000 | 734.40 | 734.40 | 0.00 | 239.891038 | 0.333333 | 0.333333 | 0.000000 | 0.166667 | 2 | 2 | 1000.0 | 72.530037 | 110.950798 | 0.00 | 6 |
| 8947 | 23.398673 | 0.833333 | 144.40 | 0.00 | 144.40 | 0.000000 | 0.833333 | 0.000000 | 0.666667 | 0.000000 | 0 | 5 | 1000.0 | 81.270775 | 82.418369 | 0.25 | 6 |
| 8948 | 13.457564 | 0.833333 | 0.00 | 0.00 | 0.00 | 36.558778 | 0.000000 | 0.000000 | 0.000000 | 0.166667 | 2 | 0 | 500.0 | 52.549959 | 55.755628 | 0.25 | 6 |
| 8949 | 372.708075 | 0.666667 | 1093.25 | 1093.25 | 0.00 | 127.040008 | 0.666667 | 0.666667 | 0.000000 | 0.333333 | 2 | 23 | 1200.0 | 63.165404 | 88.288956 | 0.00 | 6 |
2365 rows × 17 columns
minimum payment are above payments due to several things:
# Anomaly 3
df_wo_id[df_wo_id['installments_purchases'] > df_wo_id['credit_limit']]
| balance | balance_frequency | purchases | oneoff_purchases | installments_purchases | cash_advance | purchases_frequency | oneoff_purchases_frequency | purchases_installments_frequency | cash_advance_frequency | cash_advance_trx | purchases_trx | credit_limit | payments | minimum_payments | prc_full_payment | tenure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 64 | 1923.886805 | 1.000000 | 1887.64 | 0.00 | 1887.64 | 1605.949371 | 1.000000 | 0.000000 | 1.000000 | 0.083333 | 4 | 61 | 1850.0 | 3688.045998 | 3266.996777 | 0.083333 | 12 |
| 180 | 903.297810 | 1.000000 | 2697.48 | 150.50 | 2546.98 | 0.000000 | 1.000000 | 0.333333 | 1.000000 | 0.000000 | 0 | 39 | 2500.0 | 1686.196344 | 303.361632 | 0.000000 | 12 |
| 185 | 168.522709 | 0.818182 | 2009.05 | 0.00 | 2009.05 | 0.000000 | 0.833333 | 0.000000 | 0.750000 | 0.000000 | 0 | 22 | 1200.0 | 2612.760497 | 197.638558 | 0.142857 | 12 |
| 231 | 279.608696 | 1.000000 | 3034.92 | 0.00 | 3034.92 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0 | 12 | 2900.0 | 2928.909865 | 2733.317012 | 0.083333 | 12 |
| 295 | 635.090434 | 0.818182 | 5758.71 | 2252.37 | 3506.34 | 0.000000 | 0.750000 | 0.250000 | 0.750000 | 0.000000 | 0 | 48 | 2900.0 | 7943.570210 | 197.416104 | 1.000000 | 12 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8788 | 133.099445 | 1.000000 | 1170.00 | 0.00 | 1170.00 | 0.000000 | 1.000000 | 0.000000 | 0.900000 | 0.000000 | 0 | 12 | 1000.0 | 1068.502534 | 161.816503 | 0.444444 | 10 |
| 8836 | 112.037368 | 1.000000 | 1006.69 | 0.00 | 1006.69 | 0.000000 | 1.000000 | 0.000000 | 0.888889 | 0.000000 | 0 | 11 | 1000.0 | 1000.036596 | 123.408318 | 1.000000 | 9 |
| 8837 | 272.624436 | 0.888889 | 1843.00 | 470.00 | 1373.00 | 0.000000 | 0.666667 | 0.111111 | 0.444444 | 0.000000 | 0 | 6 | 1000.0 | 1477.357626 | 140.980969 | 0.400000 | 9 |
| 8856 | 227.220411 | 1.000000 | 1387.60 | 288.54 | 1099.06 | 107.660394 | 1.000000 | 0.875000 | 0.875000 | 0.125000 | 1 | 71 | 1000.0 | 1217.473837 | 136.283049 | 0.142857 | 8 |
| 8883 | 931.907808 | 1.000000 | 1142.12 | 0.00 | 1142.12 | 1113.924335 | 1.000000 | 0.000000 | 0.857143 | 0.285714 | 3 | 14 | 1000.0 | 957.421268 | 1064.015704 | 0.000000 | 7 |
118 rows × 17 columns
installments purchases above credit limit due to several things:
# Anomaly 4
df_wo_id[df_wo_id['cash_advance_frequency'] > 1]
| balance | balance_frequency | purchases | oneoff_purchases | installments_purchases | cash_advance | purchases_frequency | oneoff_purchases_frequency | purchases_installments_frequency | cash_advance_frequency | cash_advance_trx | purchases_trx | credit_limit | payments | minimum_payments | prc_full_payment | tenure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 681 | 5656.069801 | 1.000000 | 362.36 | 362.36 | 0.0 | 7240.433194 | 0.250000 | 0.250000 | 0.0 | 1.250000 | 12 | 2 | 8000.0 | 683.421497 | 2036.877611 | 0.0 | 8 |
| 1626 | 2876.009336 | 1.000000 | 152.61 | 152.61 | 0.0 | 3719.650168 | 0.333333 | 0.333333 | 0.0 | 1.166667 | 24 | 2 | 4000.0 | 248.342971 | 584.926336 | 0.0 | 6 |
| 2555 | 5906.184924 | 1.000000 | 141.80 | 141.80 | 0.0 | 1651.286918 | 0.125000 | 0.125000 | 0.0 | 1.125000 | 12 | 2 | 10000.0 | 933.969974 | 919.289675 | 0.0 | 8 |
| 2608 | 7801.511533 | 1.000000 | 231.40 | 231.40 | 0.0 | 4109.465221 | 0.100000 | 0.100000 | 0.0 | 1.100000 | 20 | 3 | 13500.0 | 1593.617739 | 1522.496755 | 0.0 | 10 |
| 3038 | 3846.742530 | 1.000000 | 0.00 | 0.00 | 0.0 | 1932.460679 | 0.000000 | 0.000000 | 0.0 | 1.500000 | 18 | 0 | 5600.0 | 496.245836 | 538.346874 | 0.0 | 6 |
| 3253 | 5709.486507 | 0.833333 | 0.00 | 0.00 | 0.0 | 2794.326341 | 0.000000 | 0.000000 | 0.0 | 1.166667 | 10 | 0 | 6000.0 | 550.513331 | 1299.463370 | 0.0 | 6 |
| 8055 | 1917.895730 | 1.000000 | 285.07 | 285.07 | 0.0 | 6084.858872 | 0.363636 | 0.363636 | 0.0 | 1.090909 | 28 | 6 | 3000.0 | 5692.682993 | 556.449635 | 0.0 | 11 |
| 8365 | 3857.562230 | 1.000000 | 0.00 | 0.00 | 0.0 | 2127.213754 | 0.000000 | 0.000000 | 0.0 | 1.142857 | 26 | 0 | 5000.0 | 617.508991 | 538.396872 | 0.0 | 7 |
Since the amount of cash_advance_frequency above 1 are below 1%, we decide to change the amount to 1, with assumption that the change will not harm the data distribution.
# change cash_advance_frequency above 1 to 1
cash_advance_freq_more_than_1 = df_wo_id[df_wo_id['cash_advance_frequency'] > 1].index
df_wo_id['cash_advance_frequency'].iloc[cash_advance_freq_more_than_1] = 1
#check whether there is still cash_advance_frequency above 1
df_wo_id[df_wo_id['cash_advance_frequency'] > 1].cash_advance_frequency.any()
False
#check whether there is any duplicate value
df_wo_id[df_wo_id.duplicated()]
| balance | balance_frequency | purchases | oneoff_purchases | installments_purchases | cash_advance | purchases_frequency | oneoff_purchases_frequency | purchases_installments_frequency | cash_advance_frequency | cash_advance_trx | purchases_trx | credit_limit | payments | minimum_payments | prc_full_payment | tenure |
|---|
# check missing value
df_wo_id.isna().sum()
balance 0 balance_frequency 0 purchases 0 oneoff_purchases 0 installments_purchases 0 cash_advance 0 purchases_frequency 0 oneoff_purchases_frequency 0 purchases_installments_frequency 0 cash_advance_frequency 0 cash_advance_trx 0 purchases_trx 0 credit_limit 1 payments 0 minimum_payments 313 prc_full_payment 0 tenure 0 dtype: int64
# plot to visualize the missing value proportion
missingno.bar(df_wo_id,color="dodgerblue", sort="ascending", figsize=(10,5), fontsize=12)
<AxesSubplot:>
We found that some feature have a missing value, which is "minimum_payments" amounting to 313 and "credit_limit" amounting to 1. We decided to handle it :
# import library
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# filling missing value with iterative imputer
feat_cols = [col for col in df_wo_id.columns]
itr_imputer = IterativeImputer(initial_strategy='median',
min_value=0, random_state=2022)
df[feat_cols] = itr_imputer.fit_transform(df[feat_cols])
# plot to check the missing value
plt.figure(figsize = (15,10))
sns.heatmap(df.isnull(), cmap='Blues', cbar=False, yticklabels=False, xticklabels=df.columns)
<AxesSubplot:>
# checking the missing value
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8950 entries, 0 to 8949 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cust_id 8950 non-null object 1 balance 8950 non-null float64 2 balance_frequency 8950 non-null float64 3 purchases 8950 non-null float64 4 oneoff_purchases 8950 non-null float64 5 installments_purchases 8950 non-null float64 6 cash_advance 8950 non-null float64 7 purchases_frequency 8950 non-null float64 8 oneoff_purchases_frequency 8950 non-null float64 9 purchases_installments_frequency 8950 non-null float64 10 cash_advance_frequency 8950 non-null float64 11 cash_advance_trx 8950 non-null float64 12 purchases_trx 8950 non-null float64 13 credit_limit 8950 non-null float64 14 payments 8950 non-null float64 15 minimum_payments 8950 non-null float64 16 prc_full_payment 8950 non-null float64 17 tenure 8950 non-null float64 dtypes: float64(17), object(1) memory usage: 1.2+ MB
# creating new variable for modeling
df_model = df.copy()
Since our purpose in this project is clustering (customer segmentation) and only process the clustering using a customer behaviour related feature, we decided to drop several feature
Reference : https://medium.com/moderncompany/using-behavioral-segmentation-to-understand-your-customers-87f3553f3cf8
# droppping unused feature
df_model.drop(['cust_id', 'balance', 'purchases',
'cash_advance', 'cash_advance_trx',
'purchases_trx', 'credit_limit',
'payments', 'minimum_payments', 'tenure'],
axis=1, inplace=True, errors='ignore')
df_model.columns
Index(['balance_frequency', 'oneoff_purchases', 'installments_purchases',
'purchases_frequency', 'oneoff_purchases_frequency',
'purchases_installments_frequency', 'cash_advance_frequency',
'prc_full_payment'],
dtype='object')
Since the amount contains no specific information we also made a change to oneoff_purchase to oneoff_proportion and installments_purchase to installments_proportion for better understanding in the purchase behaviour.
# change oneoff_purchase to oneoff_proportion
oneoff_proportion = df_wo_id['oneoff_purchases'] / df_wo_id['purchases']
# change installments_purchase to installments_proportion
installments_proportion = df_wo_id['installments_purchases'] / df_wo_id['purchases']
# rename columns
df_model.rename(columns={'oneoff_purchases_frequency': 'oneoff_frequency',
'purchases_installments_frequency': 'installments_frequency',
'prc_full_payment': 'payments_proportion'},
inplace=True, errors='ignore')
df_model.columns
Index(['balance_frequency', 'oneoff_purchases', 'installments_purchases',
'purchases_frequency', 'oneoff_frequency', 'installments_frequency',
'cash_advance_frequency', 'payments_proportion'],
dtype='object')
# check changed feature
df_model.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| balance_frequency | 0.818182 | 0.909091 | 1.00 | 0.636364 | 1.000000 |
| oneoff_purchases | 0.000000 | 0.000000 | 773.17 | 1499.000000 | 16.000000 |
| installments_purchases | 95.400000 | 0.000000 | 0.00 | 0.000000 | 0.000000 |
| purchases_frequency | 0.166667 | 0.000000 | 1.00 | 0.083333 | 0.083333 |
| oneoff_frequency | 0.000000 | 0.000000 | 1.00 | 0.083333 | 0.083333 |
| installments_frequency | 0.083333 | 0.000000 | 0.00 | 0.000000 | 0.000000 |
| cash_advance_frequency | 0.000000 | 0.250000 | 0.00 | 0.083333 | 0.000000 |
| payments_proportion | 0.000000 | 0.222222 | 0.00 | 0.000000 | 0.000000 |
# creating function for detailed descriptive analytic
def summary_stats(df_model, n=4):
# central tendency: mean, median
mean = pd.DataFrame(df_model.apply(np.mean)).T
median = pd.DataFrame(df_model.apply(np.median)).T
# distribution: ,std, min, max, range, skew, kurtosis
std = pd.DataFrame(df_model.apply(np.std)).T
min_value = pd.DataFrame(df_model.apply(min)).T
max_value = pd.DataFrame(df_model.apply(max)).T
range_value = pd.DataFrame(df_model.apply(lambda x: x.max() - x.min())).T
skewness = pd.DataFrame(df_model.apply(lambda x: x.skew())).T
kurtosis = pd.DataFrame(df_model.apply(lambda x: x.kurtosis())).T
# concatenates
summary_stats = pd.concat([min_value, max_value, range_value, mean, median, std, skewness, kurtosis]).T.reset_index()
summary_stats.columns = ['attributes','min','max', 'range','mean','median', 'std','skewness','kurtosis']
return round(summary_stats, n)
# descriptive analytic for modeling feature
summary_stats(df_model)
| attributes | min | max | range | mean | median | std | skewness | kurtosis | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | balance_frequency | 0.0 | 1.00 | 1.00 | 0.8773 | 1.0000 | 0.2369 | -2.0233 | 3.0924 |
| 1 | oneoff_purchases | 0.0 | 40761.25 | 40761.25 | 592.4374 | 38.0000 | 1659.7952 | 10.0451 | 164.1876 |
| 2 | installments_purchases | 0.0 | 22500.00 | 22500.00 | 411.0676 | 89.0000 | 904.2876 | 7.2991 | 96.5752 |
| 3 | purchases_frequency | 0.0 | 1.00 | 1.00 | 0.4904 | 0.5000 | 0.4013 | 0.0602 | -1.6386 |
| 4 | oneoff_frequency | 0.0 | 1.00 | 1.00 | 0.2025 | 0.0833 | 0.2983 | 1.5356 | 1.1618 |
| 5 | installments_frequency | 0.0 | 1.00 | 1.00 | 0.3644 | 0.1667 | 0.3974 | 0.5092 | -1.3986 |
| 6 | cash_advance_frequency | 0.0 | 1.50 | 1.50 | 0.1351 | 0.0000 | 0.2001 | 1.8287 | 3.3347 |
| 7 | payments_proportion | 0.0 | 1.00 | 1.00 | 0.1537 | 0.0000 | 0.2925 | 1.9428 | 2.4324 |
kurtosis: the sharpness of the peak of a frequency-distribution curve. the bigger number are sharper.
skewness: the asymmetry of a distribution
Check New Feature¶
# Knowing proportion anomalies (nan, inf, more_than_1)
def proportion_anomalies():
nan_oneoff_proportion = np.isnan(oneoff_proportion).sum()
nan_installments_proportion = np.isnan(installments_proportion).sum()
inf_oneoff_proportion = np.isinf(oneoff_proportion).sum()
inf_installments_proportion = np.isinf(installments_proportion).sum()
more_than_1_oneoff_proportion = len(oneoff_proportion[oneoff_proportion > 1])
more_than_1_installments_proportion = len(installments_proportion[installments_proportion > 1])
proportion_anomalies = pd.DataFrame({'nan': [nan_oneoff_proportion, nan_installments_proportion],
'inf': [inf_oneoff_proportion, inf_installments_proportion],
'more_than_1': [more_than_1_oneoff_proportion, more_than_1_installments_proportion]},
index=['oneoff_proportion', 'installments_proportion'])
return proportion_anomalies
proportion_anomalies()
| nan | inf | more_than_1 | |
|---|---|---|---|
| oneoff_proportion | 2044 | 0 | 3 |
| installments_proportion | 2042 | 2 | 14 |
df_wo_id[np.isnan(installments_proportion)].loc[:, ['purchases', 'installments_purchases']].value_counts()
purchases installments_purchases 0.0 0.0 2042 dtype: int64
df_wo_id[np.isnan(oneoff_proportion)].iloc[:, 2:4].value_counts()
purchases oneoff_purchases 0.0 0.0 2044 dtype: int64
We noted that there are 2.041 missing value and 3 value that are bigger than 1 in oneoff_proportion and 2.039 missing value, 2 infinite number, and 14 value that are bigger than 1. Since this is unusual value, we decided to transform it to reasonable value.
# Fill NaN value with zero value
oneoff_proportion.fillna(0, inplace=True)
installments_proportion.fillna(0, inplace=True)
# check anomalies
proportion_anomalies()
| nan | inf | more_than_1 | |
|---|---|---|---|
| oneoff_proportion | 0 | 0 | 3 |
| installments_proportion | 0 | 2 | 14 |
df_wo_id.iloc[installments_proportion[installments_proportion == np.inf].index].iloc[:,2:5]
| purchases | oneoff_purchases | installments_purchases | |
|---|---|---|---|
| 4682 | 0.0 | 0.0 | 20.00 |
| 5737 | 0.0 | 0.0 | 66.95 |
Ternyata inf di dapat dari installments dimana nilainya >0 di bagi dengan 0.
kita akan ubah menjadi 1, dengan asumsi karena oneoff-nya 0 installments-nya >0, maka 100% pembelian didapat dari installments
# Mengubah inf menjadi 1
# inf hasil dari x/0 dimana x>0
inf_to_1 = installments_proportion[installments_proportion == np.inf].index
installments_proportion.iloc[inf_to_1] = 1
proportion_anomalies()
| nan | inf | more_than_1 | |
|---|---|---|---|
| oneoff_proportion | 0 | 0 | 3 |
| installments_proportion | 0 | 0 | 12 |
Untuk variabel yang bernilai lebih dari 1 akan langsung kita timpa dengan nilai 1, dengan asumsi jumlahnya sedikit sehingga tidak akan merusak distribusi yang ada.
# Mengubah nilai >1 menjadi 1
# data hanya sedikit sehingga distribusinya tidak akan rusak
oneoff_more_than_1 = oneoff_proportion[oneoff_proportion > 1].index
oneoff_proportion.iloc[oneoff_more_than_1] = 1
installments_more_than_1 = installments_proportion[installments_proportion > 1].index
installments_proportion.iloc[installments_more_than_1] = 1
proportion_anomalies()
| nan | inf | more_than_1 | |
|---|---|---|---|
| oneoff_proportion | 0 | 0 | 0 |
| installments_proportion | 0 | 0 | 0 |
# change the oneoff_purchase with oneoff_proportion and installment_purchase with installment_proportion
df_model.oneoff_purchases = oneoff_proportion
df_model.installments_purchases = installments_proportion
df_model.rename(columns={'oneoff_purchases': 'oneoff_proportion',
'installments_purchases': 'installments_proportion'},
inplace=True)
df_model.head()
| balance_frequency | oneoff_proportion | installments_proportion | purchases_frequency | oneoff_frequency | installments_frequency | cash_advance_frequency | payments_proportion | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.818182 | 0.0 | 1.0 | 0.166667 | 0.000000 | 0.083333 | 0.000000 | 0.000000 |
| 1 | 0.909091 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.250000 | 0.222222 |
| 2 | 1.000000 | 1.0 | 0.0 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
| 3 | 0.636364 | 1.0 | 0.0 | 0.083333 | 0.083333 | 0.000000 | 0.083333 | 0.000000 |
| 4 | 1.000000 | 1.0 | 0.0 | 0.083333 | 0.083333 | 0.000000 | 0.000000 | 0.000000 |
# descriptive analytic for modeling feature
summary_stats(df_model)
| attributes | min | max | range | mean | median | std | skewness | kurtosis | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | balance_frequency | 0.0 | 1.0 | 1.0 | 0.8773 | 1.0000 | 0.2369 | -2.0233 | 3.0924 |
| 1 | oneoff_proportion | 0.0 | 1.0 | 1.0 | 0.3795 | 0.1114 | 0.4256 | 0.4578 | -1.5581 |
| 2 | installments_proportion | 0.0 | 1.0 | 1.0 | 0.3923 | 0.1851 | 0.4290 | 0.4509 | -1.5629 |
| 3 | purchases_frequency | 0.0 | 1.0 | 1.0 | 0.4904 | 0.5000 | 0.4013 | 0.0602 | -1.6386 |
| 4 | oneoff_frequency | 0.0 | 1.0 | 1.0 | 0.2025 | 0.0833 | 0.2983 | 1.5356 | 1.1618 |
| 5 | installments_frequency | 0.0 | 1.0 | 1.0 | 0.3644 | 0.1667 | 0.3974 | 0.5092 | -1.3986 |
| 6 | cash_advance_frequency | 0.0 | 1.5 | 1.5 | 0.1351 | 0.0000 | 0.2001 | 1.8287 | 3.3347 |
| 7 | payments_proportion | 0.0 | 1.0 | 1.0 | 0.1537 | 0.0000 | 0.2925 | 1.9428 | 2.4324 |
kurtosis: the sharpness of the peak of a frequency-distribution curve. the bigger number are sharper.
skewness: the asymmetry of a distribution
For the modeling, we decided to use PCA before we clustering. Since PCA could help to reduce the number of feature for easier interpretation and simplify the complex pattern in modeling.
After that, we will cluster the new feature with several model like Agglomerative Clustering(Ward, Average, and Complete), K-Means, and K-Medoids then compare the result based on some metrics and the number of clusters.
For Metrics, we will use :
Silhouette Score :
The technique provides a succinct graphical representation of how well each object has been classified. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. source : https://en.wikipedia.org/wiki/Silhouette_(clustering)
Davies Bouldin Score :
These conditions constrain the index so defined to be symmetric and non-negative. Due to the way it is defined, as a function of the ratio of the within cluster scatter, to the between cluster separation, a lower value will mean that the clustering is better. It happens to be the average similarity between each cluster and its most similar one, averaged over all the clusters, where the similarity is defined as Si above. This affirms the idea that no cluster has to be similar to another, and hence the best clustering scheme essentially minimizes the Davies–Bouldin index. the closer to 0 the value, the better. source : https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index
Calinski Harabasz Score :
The Calinski-Harabasz index also known as the Variance Ratio Criterion, is the ratio of the sum of between-clusters dispersion and of inter-cluster dispersion for all clusters, the higher the score , the better the performances. For Calinski Harabasz score, unlike other metrics we have seen, this score is not bounded. The value should be as high as possible, that is, a high CH score is desirable. source : https://medium.com/@haataa/how-to-measure-clustering-performances-when-there-are-no-ground-truth-db027e9a871c
# Model Init
number_of_cluster = np.arange(2,11) # Number of Clusters
np.random.seed(2022) # Set global randomseed for sklearn models
#running PCA with full components
pca_all = PCA(n_components=len(df_model.columns), random_state = 2022)
pca_all.fit(df_model)
variance = pca_all.explained_variance_ratio_
var = np.cumsum(variance)*100
#plot for information extracted
fig = px.line(x=np.arange(len(df_model.columns))+1, y=var, markers=True)
fig.update_xaxes(title_text='# of Features')
fig.update_yaxes(title_text='% Variance Explained')
fig.update_layout(width=900)
#total information extracted after pca
print('information extracted {} %'.format(np.sum(variance[:2])*100/np.sum(variance)))
information extracted 74.33796026869172 %
# Reducing the dimensions of the data
pca_n = PCA(n_components = 2, random_state = 2022)
X_principal = pca_n.fit_transform(df_model)
X_principal = pd.DataFrame(X_principal)
X_principal.columns = ['P1', 'P2']
summary_stats(X_principal)
| attributes | min | max | range | mean | median | std | skewness | kurtosis | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | P1 | -0.9035 | 1.2076 | 2.111 | 0.0 | -0.0803 | 0.6634 | 0.2954 | -1.424 |
| 1 | P2 | -0.8021 | 1.2440 | 2.046 | -0.0 | -0.1210 | 0.5193 | 0.4634 | -0.921 |
#plot for dimensions after pca
fig = go.Figure(go.Scatter(
x=X_principal['P1'], y=X_principal['P2'], mode='markers'))
fig.update_xaxes(title_text='P2')
fig.update_yaxes(title_text='P1')
fig.update_layout(height=700, width=700,
title_text='Principal Component Analysis')
Agglomerative Clustering is clustering strategic in hierarchy clustering (clustering techniques form a hierarchy or are based on a certain level so that it resembles a tree structure) starting with each object in a separate cluster and then forming an increasing large clustering. we use 3 Linkage in Agglomerative :
# Create Accumulator for matric
ward_s_scores = []
ward_db_scores = []
ward_calinski_scores = []
average_s_scores = []
average_db_scores = []
average_calinski_scores = []
complete_s_scores = []
complete_db_scores = []
complete_calinski_scores = []
agglo_metrics = {'ward': [ward_s_scores, ward_db_scores, ward_calinski_scores, 'rgb(128,177,211)'],
'average': [average_s_scores, average_db_scores, average_calinski_scores, 'rgb(141,211,199)'],
'complete': [complete_s_scores, complete_db_scores, complete_calinski_scores, 'rgb(251,128,114)']}
for i, j in product(agglo_metrics, number_of_cluster):
agglo = AgglomerativeClustering(linkage=i, n_clusters=j)
agglo.fit(X_principal)
agglo_metrics[i][0].append(silhouette_score(
X_principal, agglo.labels_))
agglo_metrics[i][1].append(davies_bouldin_score(
X_principal, agglo.labels_))
agglo_metrics[i][2].append(calinski_harabasz_score(
X_principal, agglo.labels_))
#silhouette score between each method
fig = make_subplots(rows=3, cols=1, shared_xaxes=True, vertical_spacing=.01)
for i in agglo_metrics:
for j, k in zip(range(1, 4), ['Silhouette', 'Davies Bouldin', 'Calinski-Harabasz']):
fig.append_trace(go.Scatter(x=list(number_of_cluster), y=agglo_metrics[i][j-1], name=i.title(),
legendgroup=i, line_color=agglo_metrics[i][-1],
showlegend=False if j != 1 else True), row=j, col=1)
fig.update_yaxes(title_text=k, row=j, col=1)
fig.update_xaxes(title_text='Number of Clusters', row=3)
fig.update_layout(height=800, width=900,
legend_title_text='Metrics',
title_text='Agglomerative Clustering Metric Scores')
#comparing result
compare_agg = pd.DataFrame({'Method' : ['Ward', 'Average', 'Complete'],
'n Cluster' :['7', '4', '7'],
'Silhouete Score' : [ward_s_scores[5], average_s_scores[2], complete_s_scores[5]],
'Davies Score' : [ward_db_scores[5], average_db_scores[2], complete_db_scores[5]],
'Calinski Score' : [ward_calinski_scores[5], average_calinski_scores[2], complete_calinski_scores[5]]})
compare_agg
| Method | n Cluster | Silhouete Score | Davies Score | Calinski Score | |
|---|---|---|---|---|---|
| 0 | Ward | 7 | 0.602435 | 0.713352 | 18002.677338 |
| 1 | Average | 4 | 0.547800 | 0.748187 | 12285.553170 |
| 2 | Complete | 7 | 0.405958 | 0.860245 | 7815.775791 |
#creating new dataframe for agglomerative clustering
df_dend = X_principal.copy()
#adding new columns for ward
agg_ward = AgglomerativeClustering(n_clusters=7, linkage='ward')
df_dend['ward'] = agg_ward.fit_predict(X_principal)
#adding new columns for average
agg_average = AgglomerativeClustering(n_clusters=4, linkage='average')
df_dend['average'] = agg_average.fit_predict(X_principal)
#adding new columns for complete
agg_complete = AgglomerativeClustering(n_clusters=7, linkage='complete')
df_dend['complete'] = agg_complete.fit_predict(X_principal)
#showing dataframe
df_dend
| P1 | P2 | ward | average | complete | |
|---|---|---|---|---|---|
| 0 | 0.120873 | -0.686723 | 2 | 0 | 1 |
| 1 | -0.574777 | -0.531535 | 3 | 2 | 0 |
| 2 | -0.340687 | 1.088999 | 0 | 3 | 5 |
| 3 | -0.827379 | 0.203195 | 4 | 3 | 0 |
| 4 | -0.805437 | 0.244772 | 4 | 3 | 0 |
| ... | ... | ... | ... | ... | ... |
| 8945 | 1.037319 | -0.172448 | 5 | 0 | 6 |
| 8946 | 0.960066 | -0.214203 | 5 | 0 | 6 |
| 8947 | 0.816581 | -0.303066 | 5 | 0 | 6 |
| 8948 | -0.565310 | -0.530952 | 3 | 2 | 0 |
| 8949 | -0.554672 | 0.726226 | 0 | 3 | 2 |
8950 rows × 5 columns
#creating visualization to compare the result from the best score for each agglomerative clustering method
num, hue = 0, ['ward', 'average', 'complete']
plt.figure(figsize=(30, 9))
for i in hue:
num += 1
plt.subplot(1, 3, num)
sns.scatterplot(x='P1', y='P2', hue=i, data=df_dend)
plt.title(i.title())
While ward method have highest score above all of the other method, a lot of cluster(7 cluster) may cause confusion when analyze each cluster. Therefore, we choose Average method as the representative from agglomerative clustering since the number of cluster are 4 and the score are also good.
K-Means¶
K-Means Clustering is a partitioning clustering method to separates data into defferent clusters. by itterative partitioning, its clustering is able to minimize the average of distance of each data to its cluster (MacQueen, 1967).
#creating list for scoring
avg_withinSS = []
#create loop to run the algorithm
for i in number_of_cluster:
kmeans = KMeans(n_clusters=i, random_state = 2022)
kmeans.fit(X_principal)
centroids = kmeans.cluster_centers_
D_k = cdist(X_principal, centroids, 'euclidean')
cIdx = np.argmin(D_k, axis=1)
dist = np.min(D_k, axis=1)
avg_withinSS.append(sum(dist)/X_principal.shape[0])
#creating plot for elbow method visualization
fig = px.line(x=number_of_cluster, y=avg_withinSS, markers=True)
fig.update_xaxes(title_text='Number of Clusters')
fig.update_yaxes(title_text='Average Within Sum of Square')
fig.update_layout(width=900)
# Create Accumulator for matric
kmeans_s_scores = []
kmeans_db_scores = []
kmeans_calinski_scores = []
#looping for modeling
for i in number_of_cluster:
kmeans = KMeans(n_clusters=i, random_state = 2022)
kmeans.fit(X_principal)
kmeans_labels = kmeans.predict(X_principal)
kmeans_s_scores.append(silhouette_score(
X_principal, kmeans_labels, metric='euclidean'))
kmeans_db_scores.append(davies_bouldin_score(X_principal, kmeans_labels))
kmeans_calinski_scores.append(
calinski_harabasz_score(X_principal, kmeans_labels))
#creating plot for scoring visualization
fig = make_subplots(rows=3, cols=1, shared_xaxes=True, vertical_spacing=.01)
kmeans_score = {'Silhouette': kmeans_s_scores,
'Davies Bouldin': kmeans_db_scores,
'Calinski-Harabasz': kmeans_calinski_scores}
for i, score in zip(np.arange(1, 4), kmeans_score):
fig.append_trace(go.Scatter(
x=list(number_of_cluster), y=kmeans_score[score], name=score), row=i, col=1)
fig.update_yaxes(title_text=score, row=i, col=1)
fig.update_xaxes(title_text='Number of Clusters', row=3)
fig.update_layout(height=800, width=900,
showlegend=False,
title_text='Metric Scores')
#creating list for metrics
kmeans_std_s_scores = []
kmeans_std_db_scores = []
kmeans_std_calinski_scores = []
#parameter for checking the cluster stability
n_sets = 8
sets = np.array_split(X_principal, n_sets)
# calculates the scores and store in their respective list
for element in sets:
kmeans = KMeans(n_clusters=4, random_state = 2022)
kmeans.fit(element)
kmeans_std_s_scores.append(silhouette_score(
element, kmeans.labels_, metric='euclidean'))
kmeans_std_db_scores.append(davies_bouldin_score(element, kmeans.labels_))
kmeans_std_calinski_scores.append(
calinski_harabasz_score(element, kmeans.labels_))
#check the cluster stability
print(f'Standard deviation Kmeans with {4} clusters:\nSilouette: {np.std(kmeans_std_s_scores)}\nDavies Bouldin: {np.std(kmeans_std_db_scores)}\nCalinski Harabasz: {np.std(kmeans_std_calinski_scores)}\n')
Standard deviation Kmeans with 4 clusters: Silouette: 0.015894377182182978 Davies Bouldin: 0.014988175775226655 Calinski Harabasz: 256.53120335065455
#choosing 4 cluster as the best number of cluster
kmeans = KMeans(n_clusters=4, random_state = 2022)
kmeans.fit(X_principal)
KMeans(n_clusters=4, random_state=2022)
#plot to visualize the result
fig = px.scatter(X_principal, x='P1', y='P2',
color=list(map(lambda x: str(x), kmeans.labels_)))
fig.add_trace(go.Scatter(x=kmeans.cluster_centers_[:, 0], y=kmeans.cluster_centers_[:, 1], mode='markers',
marker_color='rgba(152, 0, 0, .8)', marker_size=15, name='Centroids'))
fig.update_xaxes(title_text='P1')
fig.update_yaxes(title_text='P2')
fig.update_layout(height=700, width=800,
legend_title_text='Clusters',
title_text='K-Means with 4 Clusters')
K-Medoids¶
K-Medoids Clustering, also known as Partitioning Around Medoids (PAM), is a variant of the K-Means method. It is based on the use of a medoid instead of observations held by each cluster, with the aim of reducing the sensitivity of the resulting partition with respect to the extreme values present in the dataset (Vercellis, 2009).
We use 3 distance metrics :
#creating list for metrics
manhattan_avg_withinSS = []
euclidean_avg_withinSS = []
cosine_avg_withinSS = []
#creating list for metrics
kmedoids_avg_withinSS = {'cityblock': manhattan_avg_withinSS,
'euclidean': euclidean_avg_withinSS,
'cosine': cosine_avg_withinSS}
#looping for running k-medoids
for i, j in product(kmedoids_avg_withinSS, number_of_cluster):
kmedoids = KMedoids(metric=i, n_clusters=j, random_state = 2022)
kmedoids.fit(X_principal)
D_k = cdist(X_principal, kmedoids.cluster_centers_, i)
cIdx = np.argmin(D_k, axis=1)
dist = np.min(D_k, axis=1)
kmedoids_avg_withinSS[i].append(sum(dist)/X_principal.shape[0])
#creating plot to visualize elbow method scoring
fig = go.Figure()
for i in kmedoids_avg_withinSS:
fig.add_trace(go.Scatter(x=list(number_of_cluster),
y=kmedoids_avg_withinSS[i],
name=i if i != 'cityblock' else 'manhattan'))
fig.update_xaxes(title_text='Number of Clusters')
fig.update_yaxes(title_text='Average Within Sum of Square')
fig.update_layout(width=900, legend_title_text='Metrics')
# Create Accumulator for matric
manhattan_s_scores = []
manhattan_db_scores = []
manhattan_calinski_scores = []
euclidean_s_scores = []
euclidean_db_scores = []
euclidean_calinski_scores = []
cosine_s_scores = []
cosine_db_scores = []
cosine_calinski_scores = []
kmedoids_metrics = {'manhattan': [manhattan_s_scores, manhattan_db_scores, manhattan_calinski_scores, 'rgb(128,177,211)'],
'euclidean': [euclidean_s_scores, euclidean_db_scores, euclidean_calinski_scores, 'rgb(141,211,199)'],
'cosine': [cosine_s_scores, cosine_db_scores, cosine_calinski_scores, 'rgb(251,128,114)']}
#looping to run k-medoids by each distance metrics
for i, j in product(kmedoids_metrics, number_of_cluster):
kmedoids = KMedoids(metric=i, n_clusters=j, random_state = 2022)
kmedoids.fit(X_principal)
kmedoids_metrics[i][0].append(silhouette_score(
X_principal, kmedoids.labels_, metric=i))
kmedoids_metrics[i][1].append(davies_bouldin_score(
X_principal, kmedoids.labels_))
kmedoids_metrics[i][2].append(calinski_harabasz_score(
X_principal, kmedoids.labels_))
#creating plot to compare the result
fig = make_subplots(rows=3, cols=1, shared_xaxes=True, vertical_spacing=.01)
for i in kmedoids_metrics:
for j, k in zip(range(1, 4), ['Silhouette', 'Davies Bouldin', 'Calinski-Harabasz']):
fig.append_trace(go.Scatter(x=list(number_of_cluster), y=kmedoids_metrics[i][j-1], name=i.title(),
legendgroup=i, line_color=kmedoids_metrics[i][-1],
showlegend=False if j != 1 else True), row=j, col=1)
fig.update_yaxes(title_text=k, row=j, col=1)
fig.update_xaxes(title_text='Number of Clusters', row=3)
fig.update_layout(height=800, width=900,
legend_title_text='Metrics',
title_text='Metric Scores',
legend_traceorder='reversed')
We decided that 4 are the best number of cluster for each distance parameter by it's score stability.
#using 4 as number of clusters
kmedoids_manhattan = KMedoids(metric="manhattan", n_clusters=4, random_state = 2022)
kmedoids_euclidean = KMedoids(metric="euclidean", n_clusters=4, random_state = 2022)
kmedoids_cosine = KMedoids(metric="cosine", n_clusters=4, random_state = 2022)
#running the algorithm
kmedoids_manhattan.fit(X_principal)
kmedoids_euclidean.fit(X_principal)
kmedoids_cosine.fit(X_principal);
#creating plot to compare the result by each distance parameter
kmedoids_graph = {'Manhattan': kmedoids_manhattan,
'Euclidean': kmedoids_euclidean,
'Cosine': kmedoids_cosine}
plt.figure(figsize=(24, 8), constrained_layout=True)
for i, j in zip(kmedoids_graph, range(1,4)):
plt.subplot(1,3,j)
sns.scatterplot(x='P1', y='P2', data=X_principal,
hue= kmedoids_graph[i].labels_)
sns.scatterplot(data = None,
x = kmedoids_graph[i].cluster_centers_[:,0],
y = kmedoids_graph[i].cluster_centers_[:,1])
plt.title(i)
Compare Model¶
we will compare best cluster each model :
df_compare = pd.DataFrame({'Model' : ['Agglomerative (Average)', 'K-Means', 'K-Medoids (Cosine)'],
'N Cluster' : [4, 4, 4],
'Silhouette Score' : [average_s_scores[2], kmeans_s_scores[2], cosine_s_scores[2]],
'Devies Bouldin Score' : [average_db_scores[2], kmeans_db_scores[2], cosine_db_scores[2]],
'Calinski Harabasz Score' : [average_calinski_scores[2], kmeans_calinski_scores[2], cosine_calinski_scores[2]]
})
df_compare
| Model | N Cluster | Silhouette Score | Devies Bouldin Score | Calinski Harabasz Score | |
|---|---|---|---|---|---|
| 0 | Agglomerative (Average) | 4 | 0.547800 | 0.748187 | 12285.553170 |
| 1 | K-Means | 4 | 0.594159 | 0.605257 | 15797.196585 |
| 2 | K-Medoids (Cosine) | 4 | 0.793798 | 0.605751 | 15759.281845 |
The best model is K Medoids with Cosine distance, we choice k medoids for clustering the dataset
K-Medoid, Metric Distance = "Cosine", n Cluster = 4
#choosing best model for clustering
model_fix = KMedoids(metric="cosine", n_clusters=4, random_state = 2022)
model_fix.fit(X_principal)
model_fix_centers = model_fix.cluster_centers_
model_fix_labels = model_fix.predict(X_principal)
model_fix_centers
array([[ 0.15277368, 0.44430275],
[-0.54439581, -0.56792145],
[-0.73581394, 0.40044335],
[ 0.81658088, -0.30306642]])
#showing cluster
np.unique(model_fix_labels)
array([0, 1, 2, 3], dtype=int64)
#add the cluster to dataframe
df_cluster = pd.concat([df, pd.DataFrame({'cluster' :model_fix_labels})], axis = 1)
df_cluster.head()
| cust_id | balance | balance_frequency | purchases | oneoff_purchases | installments_purchases | cash_advance | purchases_frequency | oneoff_purchases_frequency | purchases_installments_frequency | cash_advance_frequency | cash_advance_trx | purchases_trx | credit_limit | payments | minimum_payments | prc_full_payment | tenure | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C10001 | 40.900749 | 0.818182 | 95.40 | 0.00 | 95.4 | 0.000000 | 0.166667 | 0.000000 | 0.083333 | 0.000000 | 0.0 | 2.0 | 1000.0 | 201.802084 | 139.509787 | 0.000000 | 12.0 | 1 |
| 1 | C10002 | 3202.467416 | 0.909091 | 0.00 | 0.00 | 0.0 | 6442.945483 | 0.000000 | 0.000000 | 0.000000 | 0.250000 | 4.0 | 0.0 | 7000.0 | 4103.032597 | 1072.340217 | 0.222222 | 12.0 | 1 |
| 2 | C10003 | 2495.148862 | 1.000000 | 773.17 | 773.17 | 0.0 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.0 | 12.0 | 7500.0 | 622.066742 | 627.284787 | 0.000000 | 12.0 | 0 |
| 3 | C10004 | 1666.670542 | 0.636364 | 1499.00 | 1499.00 | 0.0 | 205.788017 | 0.083333 | 0.083333 | 0.000000 | 0.083333 | 1.0 | 1.0 | 7500.0 | 0.000000 | 431.220633 | 0.000000 | 12.0 | 2 |
| 4 | C10005 | 817.714335 | 1.000000 | 16.00 | 16.00 | 0.0 | 0.000000 | 0.083333 | 0.083333 | 0.000000 | 0.000000 | 0.0 | 1.0 | 1200.0 | 678.334763 | 244.791237 | 0.000000 | 12.0 | 2 |
#cluster population
df_cluster['cluster'].value_counts()
3 2939 1 2287 2 2036 0 1688 Name: cluster, dtype: int64
#running PCA
pca_n = PCA(n_components = 2, random_state = 2022)
X_principal = pca_n.fit_transform(df_model)
df_X_principal = pd.DataFrame(X_principal, columns = ['Principal Component 1', 'Principal Component 2'])
df_X_principal.head(2)
| Principal Component 1 | Principal Component 2 | |
|---|---|---|
| 0 | 0.120873 | -0.686723 |
| 1 | -0.574777 | -0.531535 |
#combine pca and cluster
finalDf = pd.concat([df_X_principal, pd.DataFrame({'cluster':model_fix_labels})], axis = 1)
finalDf.head()
| Principal Component 1 | Principal Component 2 | cluster | |
|---|---|---|---|
| 0 | 0.120873 | -0.686723 | 1 |
| 1 | -0.574777 | -0.531535 | 1 |
| 2 | -0.340687 | 1.088999 | 0 |
| 3 | -0.827379 | 0.203195 | 2 |
| 4 | -0.805437 | 0.244772 | 2 |
#plot for visualization
fig = go.Figure()
def cluster(cluster, color):
fig.add_trace(go.Scatter(
x=finalDf[finalDf.cluster == cluster]['Principal Component 1'],
y=finalDf[finalDf.cluster == cluster]['Principal Component 2'],
mode='markers',
name=f'Cluster {cluster}',
marker_color=color))
for C in list(finalDf.cluster.unique()):
if C == 0:
cluster(0, 'red')
elif C == 1:
cluster(1, 'green')
elif C == 2:
cluster(2, 'blue')
elif C == 3:
cluster(3, 'purple')
fig.add_trace(go.Scatter(x=model_fix_centers[:, 0], y=model_fix_centers[:, 1], mode='markers',
marker_color='rgba(152, 0, 0, .8)', marker_size=15, name='Centroids'))
fig.update_xaxes(title_text='Principal Component 1')
fig.update_yaxes(title_text='Principal Component 2')
fig.update_layout(height=900, width=1100,
legend_title_text='Clusters',
title_text='K-Medoids with 4 Clusters')
#creating new df for cluster analysis
df_cluster = pd.concat([df_cluster, df_model[['oneoff_proportion',
'installments_proportion', 'payments_proportion']]], axis=1)
#separating the dataframe by cluster
cluster_0 = df_cluster[df_cluster['cluster'] == 0]
cluster_1 = df_cluster[df_cluster['cluster'] == 1]
cluster_2 = df_cluster[df_cluster['cluster'] == 2]
cluster_3 = df_cluster[df_cluster['cluster'] == 3]
#summary for each cluster
need_columns = ['balance', 'purchases', 'cash_advance', 'credit_limit', 'payments']
d={}
for i in need_columns:
d[i] = pd.DataFrame(df_cluster.groupby('cluster', axis = 0)[i].describe()[['mean', '50%', 'min', 'max']])
pd.concat(d, axis =1)
| balance | purchases | cash_advance | credit_limit | payments | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | 50% | min | max | mean | 50% | min | max | mean | 50% | min | max | mean | 50% | min | max | mean | 50% | min | max | |
| cluster | ||||||||||||||||||||
| 0 | 1867.071485 | 975.679015 | 1.198223 | 19043.13856 | 2948.853477 | 1971.195 | 8.40 | 49039.57 | 661.414446 | 0.000000 | 0.0 | 29282.10915 | 6454.697329 | 6000.0 | 300.0 | 30000.0 | 3022.354523 | 1871.896384 | 0.0 | 46930.59824 |
| 1 | 2047.578186 | 1381.753281 | 0.000000 | 14581.45914 | 22.275111 | 0.000 | 0.00 | 3000.00 | 1845.579159 | 1090.520215 | 0.0 | 26194.04954 | 4033.301508 | 3000.0 | 50.0 | 19000.0 | 1608.335929 | 718.951070 | 0.0 | 34107.07499 |
| 2 | 1520.726547 | 918.797179 | 0.000000 | 15258.22590 | 739.299317 | 384.975 | 0.01 | 26784.62 | 935.488227 | 0.000000 | 0.0 | 20277.33112 | 4392.651217 | 3000.0 | 150.0 | 30000.0 | 1485.728802 | 732.711626 | 0.0 | 50721.48336 |
| 3 | 1045.056985 | 208.362536 | 0.000000 | 16304.88925 | 831.868663 | 462.860 | 4.44 | 22500.00 | 516.820451 | 0.000000 | 0.0 | 47137.21176 | 3797.320345 | 2700.0 | 300.0 | 23000.0 | 1261.209571 | 662.417377 | 0.0 | 40627.59524 |
#plot comparison balance from each cluster
plt.figure(figsize=(21,5))
plt.subplot(1,4,1)
sns.histplot(cluster_0['balance'], color = 'red')
plt.title('cluster 0', size = 16)
plt.subplot(1,4,2)
sns.histplot(cluster_1['balance'], color='green' )
plt.title('cluster 1', size = 16)
plt.subplot(1,4,3)
sns.histplot(cluster_2['balance'], color='blue')
plt.title('cluster 2', size = 16)
plt.subplot(1,4,4)
sns.histplot(cluster_3['balance'], color='purple')
plt.title('cluster 3', size = 16)
plt.show()
#plot comparison purchase from each cluster
plt.figure(figsize=(21,5))
plt.subplot(1,4,1)
sns.histplot(cluster_0['purchases'], color = 'red')
plt.title('cluster 0', size = 16)
plt.subplot(1,4,2)
sns.histplot(cluster_1['purchases'], color='green')
plt.title('cluster 1', size = 16)
plt.subplot(1,4,3)
sns.histplot(cluster_2['purchases'], color='blue')
plt.title('cluster 2', size = 16)
plt.subplot(1,4,4)
sns.histplot(cluster_3['purchases'], color='purple')
plt.title('cluster 3', size = 16)
plt.show()
#Check Detail Purchases
plt.figure(figsize=(21,8))
ax = pd.DataFrame({'Cluster' : ['Cluster 0', 'Cluster 0', 'Cluster 1', 'Cluster 1','Cluster 2', 'Cluster 2','Cluster 3', 'Cluster 3'],
'Purchases' : [cluster_0['oneoff_purchases'].sum(),
cluster_0['installments_purchases'].sum(),
cluster_1['oneoff_purchases'].sum(),
cluster_1['installments_purchases'].sum(),
cluster_2['oneoff_purchases'].sum(),
cluster_2['installments_purchases'].sum(),
cluster_3['oneoff_purchases'].sum(),
cluster_3['installments_purchases'].sum()],
'Purchases Method' : ['one off', 'installments', 'one off', 'installments', 'one off', 'installments', 'one off', 'installments']})
sns.barplot(x='Cluster', y='Purchases', data=ax, hue = 'Purchases Method' )
plt.ylabel('Total Purchases (Million)')
plt.xlabel('Cluster')
plt.title('Purchases Persentage', size = 20)
Text(0.5, 1.0, 'Purchases Persentage')
#plot comparison cash advance from each cluster
plt.figure(figsize=(21,5))
plt.subplot(1,4,1)
sns.histplot(cluster_0['cash_advance'], color = 'red')
plt.title('cluster 0', size = 16)
plt.subplot(1,4,2)
sns.histplot(cluster_1['cash_advance'], color='green')
plt.title('cluster 1', size = 16)
plt.subplot(1,4,3)
sns.histplot(cluster_2['cash_advance'], color='blue')
plt.title('cluster 2', size = 16)
plt.subplot(1,4,4)
sns.histplot(cluster_3['cash_advance'], color='purple')
plt.title('cluster 3', size = 16)
plt.show()
#plot comparison credit limit from each cluster
plt.figure(figsize=(21,5))
plt.subplot(1,4,1)
sns.histplot(cluster_0['credit_limit'], color = 'red')
plt.title('cluster 0', size = 16)
plt.subplot(1,4,2)
sns.histplot(cluster_1['credit_limit'], color='green')
plt.title('cluster 1', size = 16)
plt.subplot(1,4,3)
sns.histplot(cluster_2['credit_limit'], color='blue')
plt.title('cluster 2', size = 16)
plt.subplot(1,4,4)
sns.histplot(cluster_3['credit_limit'], color='purple')
plt.title('cluster 3', size = 16)
plt.show()
##plot comparison payments from each cluster
plt.figure(figsize=(21,5))
plt.subplot(1,4,1)
sns.histplot(cluster_0['payments'], color = 'red')
plt.title('cluster 0', size = 16)
plt.subplot(1,4,2)
sns.histplot(cluster_1['payments'], color='green')
plt.title('cluster 1', size = 16)
plt.subplot(1,4,3)
sns.histplot(cluster_2['payments'], color='blue')
plt.title('cluster 2', size = 16)
plt.subplot(1,4,4)
sns.histplot(cluster_3['payments'], color='purple')
plt.title('cluster 3', size = 16)
plt.show()
#plot comparison purchase from each cluster
plt.figure(figsize=(10,7))
sns.scatterplot(data = df_cluster, x = 'purchases', y = 'balance', hue = 'cluster', palette=['red', 'green', 'blue', 'purple'])
plt.show()
plt.figure(figsize=(25,10))
plt.subplot(2,4,1)
sns.scatterplot(cluster_0['purchases'], cluster_0['balance'], color='red')
plt.title('cluster 0', size = 16)
plt.subplot(2,4,2)
sns.scatterplot(cluster_1['purchases'], cluster_1['balance'], color='green')
plt.title('cluster 1', size = 16)
plt.subplot(2,4,3)
sns.scatterplot(cluster_2['purchases'], cluster_2['balance'], color='blue')
plt.title('cluster 2', size = 16)
plt.subplot(2,4,4)
sns.scatterplot(cluster_3['purchases'], cluster_3['balance'], color='purple')
plt.title('cluster 3', size = 16)
plt.show()
#plot comparison purchase frequency from each cluster
plt.figure(figsize=(21,5))
plt.subplot(1,4,1)
sns.histplot(cluster_0['purchases_frequency'], color = 'red')
plt.title('cluster 0', size = 16)
plt.subplot(1,4,2)
sns.histplot(cluster_1['purchases_frequency'], color='green')
plt.title('cluster 1', size = 16)
plt.subplot(1,4,3)
sns.histplot(cluster_2['purchases_frequency'], color='blue')
plt.title('cluster 2', size = 16)
plt.subplot(1,4,4)
sns.histplot(cluster_3['purchases_frequency'], color='purple')
plt.title('cluster 3', size = 16)
plt.show()
#plot comparison one off purchase from each cluster
plt.figure(figsize=(21,5))
plt.subplot(1,4,1)
sns.histplot(cluster_0['oneoff_purchases_frequency'], color = 'red')
plt.title('cluster 0', size = 16)
plt.subplot(1,4,2)
sns.histplot(cluster_1['oneoff_purchases_frequency'], color='green')
plt.title('cluster 1', size = 16)
plt.subplot(1,4,3)
sns.histplot(cluster_2['oneoff_purchases_frequency'], color='blue')
plt.title('cluster 2', size = 16)
plt.subplot(1,4,4)
sns.histplot(cluster_3['oneoff_purchases_frequency'], color='purple')
plt.title('cluster 3', size = 16)
plt.show()
#plot comparison installment purchase from each cluster
plt.figure(figsize=(21,5))
plt.subplot(1,4,1)
sns.histplot(cluster_0['purchases_installments_frequency'], color = 'red')
plt.title('cluster 0', size = 16)
plt.subplot(1,4,2)
sns.histplot(cluster_1['purchases_installments_frequency'], color='green')
plt.title('cluster 1', size = 16)
plt.subplot(1,4,3)
sns.histplot(cluster_2['purchases_installments_frequency'], color='blue')
plt.title('cluster 2', size = 16)
plt.subplot(1,4,4)
sns.histplot(cluster_3['purchases_installments_frequency'], color='purple')
plt.title('cluster 3', size = 16)
plt.show()
#plot comparison cash advance frequency from each cluster
plt.figure(figsize=(21,5))
plt.subplot(1,4,1)
sns.histplot(cluster_0['cash_advance_frequency'], color = 'red')
plt.title('cluster 0', size = 16)
plt.subplot(1,4,2)
sns.histplot(cluster_1['cash_advance_frequency'], color='green')
plt.title('cluster 1', size = 16)
plt.subplot(1,4,3)
sns.histplot(cluster_2['cash_advance_frequency'], color='blue')
plt.title('cluster 2', size = 16)
plt.subplot(1,4,4)
sns.histplot(cluster_3['cash_advance_frequency'], color='purple')
plt.title('cluster 3', size = 16)
plt.show()
#3d plot for oneoff purchase frequency, purchase installment frequency, and cash advance frequency
fig = go.Figure()
def cluster3d(cluster, color):
fig.add_trace(go.Scatter3d(
x=df_cluster[df_cluster.cluster == cluster]['oneoff_purchases_frequency'],
y=df_cluster[df_cluster.cluster == cluster]['purchases_installments_frequency'],
z=df_cluster[df_cluster.cluster == cluster]['cash_advance_frequency'],
mode='markers',
name=f'Cluster {cluster}',
marker_color=color,
marker_size=5))
for C in list(df_cluster.cluster.unique()):
if C == 0:
cluster3d(0, 'red')
elif C == 1:
cluster3d(1, 'green')
elif C == 2:
cluster3d(2, 'blue')
elif C == 3:
cluster3d(3, 'purple')
fig.update_layout(height=900, width=1100,
scene=dict(
xaxis_title='oneoff_purchases_frequency',
yaxis_title='purchases_installments_frequency',
zaxis_title='cash_advance_frequency'),
title_text='A 3D Projection Of Data In The Reduced Dimension')
Characteristic:
Analysis :
Users in this cluster are users who like to shop using the one-time payment method (one off purchases), we assume users in this cluster use a credit card to make monthly routine payments like as paying for electricity bill, monthly consumption, or other monthly needs. So users in this cluster tend to use credit cards only to delay payment, not to pay in installments. And and our assumption is that users in this cluster also collect credit card points. We assume in customer profil in this cluster is a adults have families.
Strategy :
Characteristic :
Analysis :
This user mostly have a high balance but almost never spends it whether using credit card or not. We could assume this kind of users use their money mostly for investing activities or as saving account.
Strategy :
Characteristics :
Analysis :
Users in this cluster more often paid without installment. We assume that users in this cluster are most likely made a purchase to avoid interest from credit card.
Strategy :
Characteristics :
Analysis :
Users in this cluster are users who have a low balance and have a small credit limit. users in this cluster are very fond of shopping and use installment payments, and rarely use full payments. Our assumption is that the users in this cluster are fresh graduate users and just have a credit card, so they have a relatively small limit.
Strategy :
For the Data:
For modeling:
Save CSV - for tableau
save = pd.concat([df, pd.DataFrame(
{'cluster': model_fix_labels})], axis=1)
save.head()
| cust_id | balance | balance_frequency | purchases | oneoff_purchases | installments_purchases | cash_advance | purchases_frequency | oneoff_purchases_frequency | purchases_installments_frequency | cash_advance_frequency | cash_advance_trx | purchases_trx | credit_limit | payments | minimum_payments | prc_full_payment | tenure | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C10001 | 40.900749 | 0.818182 | 95.40 | 0.00 | 95.4 | 0.000000 | 0.166667 | 0.000000 | 0.083333 | 0.000000 | 0.0 | 2.0 | 1000.0 | 201.802084 | 139.509787 | 0.000000 | 12.0 | 1 |
| 1 | C10002 | 3202.467416 | 0.909091 | 0.00 | 0.00 | 0.0 | 6442.945483 | 0.000000 | 0.000000 | 0.000000 | 0.250000 | 4.0 | 0.0 | 7000.0 | 4103.032597 | 1072.340217 | 0.222222 | 12.0 | 1 |
| 2 | C10003 | 2495.148862 | 1.000000 | 773.17 | 773.17 | 0.0 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.0 | 12.0 | 7500.0 | 622.066742 | 627.284787 | 0.000000 | 12.0 | 0 |
| 3 | C10004 | 1666.670542 | 0.636364 | 1499.00 | 1499.00 | 0.0 | 205.788017 | 0.083333 | 0.083333 | 0.000000 | 0.083333 | 1.0 | 1.0 | 7500.0 | 0.000000 | 431.220633 | 0.000000 | 12.0 | 2 |
| 4 | C10005 | 817.714335 | 1.000000 | 16.00 | 16.00 | 0.0 | 0.000000 | 0.083333 | 0.083333 | 0.000000 | 0.000000 | 0.0 | 1.0 | 1200.0 | 678.334763 | 244.791237 | 0.000000 | 12.0 | 2 |
save = pd.concat([save, df_model[['oneoff_proportion',
'installments_proportion', 'payments_proportion']]], axis=1)
save.head()
| cust_id | balance | balance_frequency | purchases | oneoff_purchases | installments_purchases | cash_advance | purchases_frequency | oneoff_purchases_frequency | purchases_installments_frequency | ... | purchases_trx | credit_limit | payments | minimum_payments | prc_full_payment | tenure | cluster | oneoff_proportion | installments_proportion | payments_proportion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C10001 | 40.900749 | 0.818182 | 95.40 | 0.00 | 95.4 | 0.000000 | 0.166667 | 0.000000 | 0.083333 | ... | 2.0 | 1000.0 | 201.802084 | 139.509787 | 0.000000 | 12.0 | 1 | 0.0 | 1.0 | 0.000000 |
| 1 | C10002 | 3202.467416 | 0.909091 | 0.00 | 0.00 | 0.0 | 6442.945483 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 7000.0 | 4103.032597 | 1072.340217 | 0.222222 | 12.0 | 1 | 0.0 | 0.0 | 0.222222 |
| 2 | C10003 | 2495.148862 | 1.000000 | 773.17 | 773.17 | 0.0 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | ... | 12.0 | 7500.0 | 622.066742 | 627.284787 | 0.000000 | 12.0 | 0 | 1.0 | 0.0 | 0.000000 |
| 3 | C10004 | 1666.670542 | 0.636364 | 1499.00 | 1499.00 | 0.0 | 205.788017 | 0.083333 | 0.083333 | 0.000000 | ... | 1.0 | 7500.0 | 0.000000 | 431.220633 | 0.000000 | 12.0 | 2 | 1.0 | 0.0 | 0.000000 |
| 4 | C10005 | 817.714335 | 1.000000 | 16.00 | 16.00 | 0.0 | 0.000000 | 0.083333 | 0.083333 | 0.000000 | ... | 1.0 | 1200.0 | 678.334763 | 244.791237 | 0.000000 | 12.0 | 2 | 1.0 | 0.0 | 0.000000 |
5 rows × 22 columns
save.drop('prc_full_payment', axis=1, inplace=True)
save
| cust_id | balance | balance_frequency | purchases | oneoff_purchases | installments_purchases | cash_advance | purchases_frequency | oneoff_purchases_frequency | purchases_installments_frequency | ... | cash_advance_trx | purchases_trx | credit_limit | payments | minimum_payments | tenure | cluster | oneoff_proportion | installments_proportion | payments_proportion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C10001 | 40.900749 | 0.818182 | 95.40 | 0.00 | 95.40 | 0.000000 | 0.166667 | 0.000000 | 0.083333 | ... | 0.0 | 2.0 | 1000.0 | 201.802084 | 139.509787 | 12.0 | 1 | 0.0 | 1.0 | 0.000000 |
| 1 | C10002 | 3202.467416 | 0.909091 | 0.00 | 0.00 | 0.00 | 6442.945483 | 0.000000 | 0.000000 | 0.000000 | ... | 4.0 | 0.0 | 7000.0 | 4103.032597 | 1072.340217 | 12.0 | 1 | 0.0 | 0.0 | 0.222222 |
| 2 | C10003 | 2495.148862 | 1.000000 | 773.17 | 773.17 | 0.00 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | ... | 0.0 | 12.0 | 7500.0 | 622.066742 | 627.284787 | 12.0 | 0 | 1.0 | 0.0 | 0.000000 |
| 3 | C10004 | 1666.670542 | 0.636364 | 1499.00 | 1499.00 | 0.00 | 205.788017 | 0.083333 | 0.083333 | 0.000000 | ... | 1.0 | 1.0 | 7500.0 | 0.000000 | 431.220633 | 12.0 | 2 | 1.0 | 0.0 | 0.000000 |
| 4 | C10005 | 817.714335 | 1.000000 | 16.00 | 16.00 | 0.00 | 0.000000 | 0.083333 | 0.083333 | 0.000000 | ... | 0.0 | 1.0 | 1200.0 | 678.334763 | 244.791237 | 12.0 | 2 | 1.0 | 0.0 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8945 | C19186 | 28.493517 | 1.000000 | 291.12 | 0.00 | 291.12 | 0.000000 | 1.000000 | 0.000000 | 0.833333 | ... | 0.0 | 6.0 | 1000.0 | 325.594462 | 48.886365 | 6.0 | 3 | 0.0 | 1.0 | 0.500000 |
| 8946 | C19187 | 19.183215 | 1.000000 | 300.00 | 0.00 | 300.00 | 0.000000 | 1.000000 | 0.000000 | 0.833333 | ... | 0.0 | 6.0 | 1000.0 | 275.861322 | 346.308639 | 6.0 | 3 | 0.0 | 1.0 | 0.000000 |
| 8947 | C19188 | 23.398673 | 0.833333 | 144.40 | 0.00 | 144.40 | 0.000000 | 0.833333 | 0.000000 | 0.666667 | ... | 0.0 | 5.0 | 1000.0 | 81.270775 | 82.418369 | 6.0 | 3 | 0.0 | 1.0 | 0.250000 |
| 8948 | C19189 | 13.457564 | 0.833333 | 0.00 | 0.00 | 0.00 | 36.558778 | 0.000000 | 0.000000 | 0.000000 | ... | 2.0 | 0.0 | 500.0 | 52.549959 | 55.755628 | 6.0 | 1 | 0.0 | 0.0 | 0.250000 |
| 8949 | C19190 | 372.708075 | 0.666667 | 1093.25 | 1093.25 | 0.00 | 127.040008 | 0.666667 | 0.666667 | 0.000000 | ... | 2.0 | 23.0 | 1200.0 | 63.165404 | 88.288956 | 6.0 | 2 | 1.0 | 0.0 | 0.000000 |
8950 rows × 21 columns
save = save[['cust_id', 'balance', 'balance_frequency',
'purchases', 'oneoff_purchases', 'installments_purchases', 'cash_advance',
'oneoff_proportion', 'installments_proportion',
'purchases_frequency', 'oneoff_purchases_frequency', 'purchases_installments_frequency', 'cash_advance_frequency',
'purchases_trx', 'cash_advance_trx',
'credit_limit', 'payments', 'minimum_payments', 'payments_proportion',
'tenure', 'cluster']]
Cluster_Name = []
for i in save['cluster'] :
if i == 0 :
Cluster_Name.append('Balance Spender')
elif i == 1 :
Cluster_Name.append('Money Hoarders')
elif i == 2 :
Cluster_Name.append('Potential Customer')
elif i == 3 :
Cluster_Name.append('Credit Lovers')
save['Cluster Name'] = Cluster_Name
save.head()
| cust_id | balance | balance_frequency | purchases | oneoff_purchases | installments_purchases | cash_advance | oneoff_proportion | installments_proportion | purchases_frequency | ... | cash_advance_frequency | purchases_trx | cash_advance_trx | credit_limit | payments | minimum_payments | payments_proportion | tenure | cluster | Cluster Name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C10001 | 40.900749 | 0.818182 | 95.40 | 0.00 | 95.4 | 0.000000 | 0.0 | 1.0 | 0.166667 | ... | 0.000000 | 2.0 | 0.0 | 1000.0 | 201.802084 | 139.509787 | 0.000000 | 12.0 | 1 | Money Hoarders |
| 1 | C10002 | 3202.467416 | 0.909091 | 0.00 | 0.00 | 0.0 | 6442.945483 | 0.0 | 0.0 | 0.000000 | ... | 0.250000 | 0.0 | 4.0 | 7000.0 | 4103.032597 | 1072.340217 | 0.222222 | 12.0 | 1 | Money Hoarders |
| 2 | C10003 | 2495.148862 | 1.000000 | 773.17 | 773.17 | 0.0 | 0.000000 | 1.0 | 0.0 | 1.000000 | ... | 0.000000 | 12.0 | 0.0 | 7500.0 | 622.066742 | 627.284787 | 0.000000 | 12.0 | 0 | Balance Spender |
| 3 | C10004 | 1666.670542 | 0.636364 | 1499.00 | 1499.00 | 0.0 | 205.788017 | 1.0 | 0.0 | 0.083333 | ... | 0.083333 | 1.0 | 1.0 | 7500.0 | 0.000000 | 431.220633 | 0.000000 | 12.0 | 2 | Potential Customer |
| 4 | C10005 | 817.714335 | 1.000000 | 16.00 | 16.00 | 0.0 | 0.000000 | 1.0 | 0.0 | 0.083333 | ... | 0.000000 | 1.0 | 0.0 | 1200.0 | 678.334763 | 244.791237 | 0.000000 | 12.0 | 2 | Potential Customer |
5 rows × 22 columns
save.to_csv('CC with Cluster.csv', index=False)